Content On This Page | ||
---|---|---|
Coefficient of Variation: Definition and Calculation | Comparing Variability using Coefficient of Variation | Moments (Raw and Central - Implicit Introduction) |
Measures of Relative Dispersion and Moments
Coefficient of Variation: Definition and Calculation
Need for Relative Dispersion
Absolute measures of dispersion, such as the range, mean deviation, variance ($\sigma^2$), and standard deviation ($\sigma$), provide a measure of the spread of data in the original units of measurement. While these measures are useful for understanding the variability within a single dataset, they pose challenges when comparing the variability of two or more datasets:
- Different Units: If datasets have different units (e.g., comparing the variability in heights measured in centimetres (cm) and weights measured in kilograms (kg)), their standard deviations cannot be directly compared because the units are different.
- Different Average Values: Even if datasets have the same units, comparing their standard deviations can be misleading if their means are significantly different. For instance, a standard deviation of ₹10,000 in the salaries of top executives (average salary maybe ₹1,00,00,000) represents relatively less variability than a standard deviation of ₹10,000 in the salaries of entry-level employees (average salary maybe ₹3,00,000). In the first case, the deviation is a small fraction of the mean, while in the second, it's a much larger fraction.
To overcome these limitations and allow for a meaningful comparison of variability between datasets with different units or different average magnitudes, we use relative measures of dispersion. These measures express dispersion as a ratio or percentage, which is unitless and relates the measure of spread to a measure of central tendency.
Definition of Coefficient of Variation
The most widely used relative measure of dispersion is the Coefficient of Variation (CV). It is defined as the ratio of the standard deviation ($\sigma$) to the arithmetic mean ($\bar{x}$), usually expressed as a percentage. It essentially measures the standard deviation relative to the mean.
The Coefficient of Variation is a standardized measure of dispersion that allows us to compare the degree of variability between datasets that may have different means or units of measurement.
Calculation
The formula for the Coefficient of Variation (CV) is:
CV $= \frac{\sigma}{\bar{x}} \times 100\%$
... (1)
Where:
- $\sigma$ is the standard deviation of the dataset.
- $\bar{x}$ is the arithmetic mean of the dataset.
- The result is multiplied by $100\%$ to express the CV as a percentage.
It is important that the mean ($\bar{x}$) is not zero when calculating the CV. The CV is typically used for data measured on a ratio scale (where zero indicates the complete absence of the quantity) and where the mean is positive. If the mean is negative, the interpretation becomes less straightforward.
Steps to Calculate Coefficient of Variation:
- Calculate the arithmetic mean ($\bar{x}$) of the data using the appropriate method (ungrouped or grouped data).
- Calculate the standard deviation ($\sigma$) of the data using the appropriate method (ungrouped or grouped data), taking the positive square root of the variance.
- Divide the standard deviation ($\sigma$) by the mean ($\bar{x}$).
- Multiply the resulting ratio by 100 to express the Coefficient of Variation as a percentage.
Example
Example 1. The mean and standard deviation of the heights of a group of students are 160 cm and 8 cm, respectively. The mean and standard deviation of their weights are 55 kg and 5.5 kg, respectively. Calculate the coefficient of variation for both heights and weights.
Answer:
Given:
- For Heights: Mean ($\bar{x}_H$) = 160 cm, Standard Deviation ($\sigma_H$) = 8 cm.
- For Weights: Mean ($\bar{x}_W$) = 55 kg, Standard Deviation ($\sigma_W$) = 5.5 kg.
To Calculate: Coefficient of Variation (CV) for heights and weights.
Solution:
Using the formula CV $= \frac{\sigma}{\bar{x}} \times 100\%$:
For Heights:
CV (Height) $= \frac{\sigma_H}{\bar{x}_H} \times 100\%$
... (i)
CV (Height) $= \frac{8 \text{ cm}}{160 \text{ cm}} \times 100\%$
CV (Height) $= \frac{\cancel{8}^{1}}{\cancel{160}_{20}} \times 100\%$
(Cancelling the fraction)
CV (Height) $= \frac{1}{20} \times 100\%$
CV (Height) $= 0.05 \times 100\%$
CV (Height) $= 5\%$
... (ii)
For Weights:
CV (Weight) $= \frac{\sigma_W}{\bar{x}_W} \times 100\%$
... (iii)
CV (Weight) $= \frac{5.5 \text{ kg}}{55 \text{ kg}} \times 100\%$
CV (Weight) $= \frac{5.5}{55} \times 100\%$
CV (Weight) $= \frac{55/10}{55} \times 100\% = \frac{55}{10 \times 55} \times 100\%$
CV (Weight) $= \frac{1}{10} \times 100\%$
CV (Weight) $= 0.10 \times 100\%$
CV (Weight) $= 10\%$
... (iv)
The Coefficient of Variation for height is 5%, and for weight is 10%.
Comparing Variability using Coefficient of Variation
Principle of Comparison
The primary use of the Coefficient of Variation (CV) is to provide a standardized measure of dispersion that can be used to compare the relative variability of different datasets. Since the CV is a ratio ($\sigma / \bar{x}$) and is typically expressed as a percentage, it is a unitless measure. This makes it possible to compare variability even when the datasets have different units of measurement (like height in cm and weight in kg) or when they have vastly different average values (like salaries of different professions).
- A lower Coefficient of Variation (CV) indicates that the standard deviation is relatively small compared to the mean. This suggests that the data points are less spread out relative to their average, implying greater consistency or homogeneity within the dataset.
- A higher Coefficient of Variation (CV) indicates that the standard deviation is relatively large compared to the mean. This suggests that the data points are more spread out relative to their average, implying greater variability or heterogeneity within the dataset.
Therefore, to compare the variability or consistency of two or more datasets, we calculate the CV for each dataset and compare these CV values. The dataset with the lowest CV is considered the most consistent or least variable in relative terms.
Applications of CV for Comparison
The Coefficient of Variation is widely used in various fields for comparing relative variability:
-
Comparing Consistency:
In sports, comparing the consistency of scores from different players or teams. In educational settings, comparing the consistency of performance between different classes. In manufacturing, comparing the consistency of product quality between different machines or shifts.
-
Investment and Finance:
Assessing the risk of investments relative to their expected returns. CV can be used as a measure of risk per unit of return. A lower CV suggests a less risky investment for a given average return.
-
Quality Control:
Comparing the relative variability of different processes or products to improve quality and reduce defects.
-
Biological Sciences:
Comparing the relative variation in biological measurements (e.g., size, weight, response to treatment) across different species, populations, or experimental groups.
Example
Example 1. Using the results from Example 1 in the previous section (I1), which measurement (height or weight) shows greater variability relative to its mean for the group of students?
Answer:
Given: Coefficient of Variation for Height = 5%, Coefficient of Variation for Weight = 10% (from Example 1, Section I1).
To Compare: Relative variability of height vs. weight.
Solution:
To compare relative variability, we compare the Coefficients of Variation:
- CV (Height) = 5%
- CV (Weight) = 10%
Since $10\% > 5\%$, the Coefficient of Variation for weight is higher than that for height.
A higher CV indicates greater variability relative to the mean.
Therefore, weight shows greater variability relative to its mean than height does for this group of students.
Interpretation:
Although the standard deviation of height (8 cm) is numerically larger than the standard deviation of weight (5.5 kg), the standard deviation of weight is a larger fraction (or percentage) of the average weight compared to how the standard deviation of height relates to the average height. This indicates that the weights of students in this group are relatively more spread out around their average weight than their heights are around their average height.
Example 2. Two factories, A and B, produce electric bulbs. A sample of bulbs from Factory A has a mean lifetime of 2000 hours and a standard deviation of 200 hours. A sample of bulbs from Factory B has a mean lifetime of 1800 hours and a standard deviation of 144 hours. Which factory produces bulbs with greater consistency in lifetime?
Answer:
Given:
- Factory A: $\bar{x}_A = 2000$ hours, $\sigma_A = 200$ hours.
- Factory B: $\bar{x}_B = 1800$ hours, $\sigma_B = 144$ hours.
To Determine: Which factory produces bulbs with greater consistency (less relative variability).
Solution:
To compare consistency, we calculate the Coefficient of Variation (CV) for each factory.
Using the formula CV $= \frac{\sigma}{\bar{x}} \times 100\%$:
For Factory A:
CV$_A = \frac{\sigma_A}{\bar{x}_A} \times 100\%$
... (i)
CV$_A = \frac{200 \text{ hours}}{2000 \text{ hours}} \times 100\%$
CV$_A = \frac{\cancel{200}^{1}}{\cancel{2000}_{10}} \times 100\%$
CV$_A = \frac{1}{10} \times 100\%$
CV$_A = 10\%$
... (ii)
For Factory B:
CV$_B = \frac{\sigma_B}{\bar{x}_B} \times 100\%$
... (iii)
CV$_B = \frac{144 \text{ hours}}{1800 \text{ hours}} \times 100\%$
CV$_B = \frac{\cancel{144}^{1}}{\cancel{1800}_{12.5}} \times 100\%$
($1800 \div 144 = 12.5$)
CV$_B = \frac{1}{12.5} \times 100\%$
CV$_B = 0.08 \times 100\%$
($1/12.5 = 1/(25/2) = 2/25 = 0.08$)
CV$_B = 8\%$
... (iv)
Comparison:
CV$_A = 10\%$ and CV$_B = 8\%$.
Since $8\% < 10\%$, Factory B has a lower Coefficient of Variation than Factory A.
A lower CV indicates greater consistency (less relative variability).
Therefore, Factory B produces bulbs with greater consistency in lifetime compared to Factory A.
Moments (Raw and Central - Implicit Introduction)
Beyond Mean and Variance: Describing Shape
While measures of central tendency (like mean, median, mode) describe where the data is centered, and measures of dispersion (like variance, standard deviation, range) describe how spread out the data is, these two types of measures alone do not fully characterize a frequency distribution. Distributions can have the same mean and variance but differ in their shape – specifically, their asymmetry (skewness) and peakedness (kurtosis).
To describe these higher-order characteristics of a distribution's shape, statisticians use quantities called **moments**. Moments provide a more complete set of summary statistics that can describe various features of a probability distribution or a dataset.
Raw Moments (Moments About the Origin)
The **$k^{\text{th}}$ raw moment** of a dataset is defined as the arithmetic mean of the $k^{\text{th}}$ powers of the observations. It is also called the moment about the origin because it's calculated relative to zero.
For a dataset of $n$ individual observations $x_1, x_2, \dots, x_n$, the $k^{\text{th}}$ raw moment, denoted by $m'_k$ or $\mu'_k$, is given by:
$m'_k = \frac{\sum\limits_{i=1}^{n} x_i^k}{n}$
... (1)
For a frequency distribution with distinct values or class marks $x_1, x_2, \dots, x_m$ and frequencies $f_1, f_2, \dots, f_m$, and total frequency $N = \sum f_i$, the $k^{\text{th}}$ raw moment is:
$m'_k = \frac{\sum\limits_{i=1}^{m} f_i x_i^k}{N}$
... (2)
Special Cases:
First Raw Moment ($k=1$):
$m'_1 = \frac{\sum x_i^1}{n} = \frac{\sum x_i}{n} = \bar{x}$. The first raw moment is simply the **arithmetic mean** of the dataset.
Second Raw Moment ($k=2$):
$m'_2 = \frac{\sum x_i^2}{n}$. This is the mean of the squares of the observations. It is used in the computational formula for variance ($\sigma^2 = m'_2 - (m'_1)^2$).
Higher Raw Moments ($k>2$):
Higher raw moments are less commonly interpreted directly but are used to calculate central moments.
Central Moments (Moments About the Mean)
The **$k^{\text{th}}$ central moment** of a dataset is defined as the arithmetic mean of the $k^{\text{th}}$ powers of the deviations of the observations from the **mean** ($\bar{x}$). Central moments are more informative about the shape of the distribution relative to its center.
For a dataset of $n$ individual observations $x_1, x_2, \dots, x_n$ with mean $\bar{x}$, the $k^{\text{th}}$ central moment, denoted by $m_k$ or $\mu_k$, is given by:
$m_k = \frac{\sum\limits_{i=1}^{n} (x_i - \bar{x})^k}{n}$
... (3)
For a frequency distribution with distinct values or class marks $x_i$ and frequencies $f_i$, and total frequency $N = \sum f_i$ and mean $\bar{x}$, the $k^{\text{th}}$ central moment is:
$m_k = \frac{\sum\limits_{i=1}^{m} f_i (x_i - \bar{x})^k}{N}$
... (4)
Special Cases:
First Central Moment ($k=1$):
$m_1 = \frac{\sum (x_i - \bar{x})^1}{n} = \frac{\sum (x_i - \bar{x})}{n}$. By a property of the mean, the sum of deviations from the mean is always zero ($\sum (x_i - \bar{x}) = 0$). Therefore, the first central moment is always $m_1 = 0$.
Second Central Moment ($k=2$):
$m_2 = \frac{\sum (x_i - \bar{x})^2}{n}$. This is the average of the squared deviations from the mean. This is exactly the definition of the **variance**, $\sigma^2$. So, $m_2 = \sigma^2$.
Third Central Moment ($k=3$):
$m_3 = \frac{\sum (x_i - \bar{x})^3}{n}$. This moment measures the **skewness** or asymmetry of the distribution.
- If $m_3 = 0$, the distribution is symmetric.
- If $m_3 > 0$, the distribution is positively skewed (skewed right).
- If $m_3 < 0$, the distribution is negatively skewed (skewed left).
Fourth Central Moment ($k=4$):
$m_4 = \frac{\sum (x_i - \bar{x})^4}{n}$. This moment measures the **kurtosis** or peakedness/flatness of the distribution compared to a normal distribution. A higher $m_4$ value indicates a more peaked distribution with heavier tails, while a lower value indicates a flatter distribution. The measure of kurtosis based on the fourth moment is $\gamma_2 = m_4 / \sigma^4 - 3$.
Relationship between Raw and Central Moments:
Central moments can be expressed in terms of raw moments. Some common relationships are:
- $m_1 = m'_1 - m'_1 = 0$ (as shown above)
- $m_2 = m'_2 - (m'_1)^2$. This is the same as the computational formula for variance: $\sigma^2 = \frac{\sum x^2}{n} - (\bar{x})^2$.
- $m_3 = m'_3 - 3 m'_2 m'_1 + 2 (m'_1)^3$
- $m_4 = m'_4 - 4 m'_3 m'_1 + 6 m'_2 (m'_1)^2 - 3 (m'_1)^4$
Moments provide a more comprehensive mathematical description of a distribution's shape and properties. The lower-order moments (mean and variance) are related to location and spread, while higher-order moments (especially the third and fourth) are used to quantify asymmetry and peakedness, which are key aspects of describing the shape of a distribution.